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Abstract 

We have participated in tlie 
Senseval-2 Englisli tasks (all 
words and lexical sample) with 
an unsupervised system based on 
mutual information measured over 
a large corpus (277 million words) 
and some additional heuristics. A 
supervised extension of the system 
was also presented to the lexical 
sample task. 

Our system scored first among un- 
supervised systems in both tasks: 
56.9% recall in all words, 40.2% in 
lexical sample. This is slightly worse 
than the first sense heuristic for all 
words and 3.6% better for the lexical 
sample, a strong indication that un- 
supervised Word Sense Disambigua- 
tion remains being a strong chal- 
lenge. 

1 Introduction 

We advocate researching unsupervised tech- 
niques for Word Sense Disambiguation 
(WSD). Supervised techniques offer better 
results in general but the setbacks, such as the 
problem of developing reliable training data, 
are very considerable. Also there's probably 
more to WSD than blind machine learning 
(a typical approach, although such systems 
produce interesting baselines). 



Within the unsupervised paradigm, we 
are interested in performing in-depth mea- 
sures of the disambiguation potential of 
different sources of information. We 
have previously investigated the informa- 
tional value of semantic distance measures 
in ( Fernandez- Amoros et al., 2001 ). For 



Senseval-2, we have turned to investigate 
pure coocurrence information as a source of 
disambiguation evidence. In essence, our sys- 
tem computes a matrix of mutual information 
for a fixed vocabulary and applies it to weight 
coocurrence counting between sense and con- 
text characteristic vectors. 

In the next section we describe the process 
of constructing the relevance matrix. In sec- 
tion 3 we present the particular heuristics used 
for the competing systems. In section 4 we 
show the results by system and heuristic and 
some baselines for comparison. Finally in the 
last sections we draw some conclusions about 
the exercise. 

2 The Relevance matrix 

2.1 Corpus processing 

Before building our systems we have developed 
a resource we've called the relevance matrix. 
The raw data used to build the matrix comes 
from the Project Gutenberg (PG) 0. 

At the time of the creation of the matrix 
the PG consisted of more than 3000 books of 
diverse genres. We have adapted these books 



^http : //promo .net/pg 
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for our purpose : First, language identification 
was used to filter books written in English; 
Then we stripped off the disclaimers. The re- 
sult is a collection of around 1.3Gb of plain 
text. 

Finally we tokenize, lemmatize, strip punc- 
tuation and stop words and detect numbers 
and proper nouns. 

2.2 Coocurrence matrix 

We have built a vocabulary of the 20000 most 
frequent words (or labels, as we have changed 
all the proper nouns detected to the label 
PROPER_NOUN and all numbers detected 
to NUMBER) in the text and a symmet- 
ric coocurrence matrix between these words 
within a context of 61 words (we thought a 
broad context of radius 30 would be appropri- 
ate since we are trying to capture vague se- 
mantic relations). 

2.3 Relevance matrix 

In a second step, we have built another sym- 
metric matrix, which we have called relevance 
matrix, using a mutual information measure 
between the words (or labels), so that for two 
words a and b, the entry for them would be 
P(b)'p{a) ' where P{a) is the probability of find- 
ing the word a in a random context of a given 
size. P{ar)b) is the probability of finding both 
a and 6 in a random context of the fixed size. 
We've introduced a threshold of 2 below which 
we set the entry to zero for practical purposes. 
We think that this is a valuable resource that 
could be of interest for many other applica- 
tions other than WSD. Also, it can only grow 
in quality since at the time of making this re- 
port the data in the PG has almost doubled in 
size. 

3 Cascade of heuristics 

We have developed a very simple language in 
order to systematize the experiments. This 
language allows the construction of WSD sys- 
tems composed of different heuristics that are 
applied in cascade so that each word to be dis- 
ambiguated is presented to the first heuristic, 
and if it fails to disambiguate, then the word 



is passed on to the second heuristic and so on. 
We can have several such systems running in 
parallel for efficiency reasons (the matrix has 
high memory requirements). Next we show 
the heuristics we have considered to build the 
systems 

• Monosemous expressions. 

Monosemous expressions are simply un- 
ambiguous words in the case of the all 
words English task. In the case of the lex- 
ical sample English task, however, the an- 
notations include multiword expressions. 
We have implemented a multiword term 
detector that considers the multiword 
terms from WordNet's index. sense file and 
detects them in the test file using a mul- 
tilevel backtracking algorithm that takes 
account of the inflected and base forms 
of the components of a particular multi- 
word in order to maximize multiword de- 
tection. We tested this algorithm against 
the PG and found millions of these mul- 
tiword terms. 

We restricted ourselves to the multiwords 
already present in the training file since 
there are, apparently, multiword expres- 
sions that where overlooked during man- 
ual tagging (for instance the WordNet ex- 
pression 'the_good_old_days' is not hand- 
tagged as such in the test files) 

• Statistical filter 

WordNet comes with a file, cntlist, lit- 
erally 'file listing number of times each 
tagged sense occurs in a semantic con- 
cordance' so we use this to compute the 
relative probability of a sense given a 
word (approximate in the case of collec- 
tions other than SemCor). Using this in- 
formation, we eliminated the senses that 
had a probability under 10% and if only 
one sense remains we choose it. Oth- 
erwise we go on to the next heuristic. 
In other words, we didn't apply complex 
techniques with words which are highly 
skewed in meaning H. 

^Some people may argue that this is a supervised 
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• Relevance filter 

This heuristic makes use of the relevance 
matrix. In order to assign a score to a 
sense, we count the coocurrences of words 
in the context of the word to be dis- 
ambiguated with the words in the defi- 
nition of the senses (the WordNet gloss 
tokenized, lemmatized and stripped out 
of stop words and punctuation signs) 
weighting each coocurrence by the entry 
in the relevance matrix for the word to 
be disambiguated and the word whose 
coocurrences are being counted, i.e., if s 
is a sense of the word a whose definition 
is S and C is the context in which a is 
to be disambiguated, then the score for s 
would be: 

Rujai'req{w,C)beq{w, S)idf{w,a) 

Where idf{w,a) = log^, with N being 
the number of senses for word a and 
the number of sense glosses in which w 
appears. freq(tt;, C) is the frequency of 
word w in the context C and freq(tt;, 5) is 
the frequency of w in the sense gloss 5. 

The idea is to prime the occurrences of 
words that are relevant to the word being 
disambiguated and give low credit (possi- 
bly none) to the words that are inciden- 
tally used in the context. 

Also, in the all words task (where POS 
tags from the TreeBank are provided) we 
have considered only the context words 
that have a POS tag compatible with that 
of the word being disambiguated. By 
compatible we mean nouns and nouns, 
nouns and verbs, nouns and adjectives, 
verbs and verbs, verbs and adverbs and 
vice versa. Roughly speaking, words that 
can have an intra-phrase relation. 

approach. In our opinion, the cntlist information does 
not make a system supervised per se, because a) It is 
standard information provided as part of the dictionary 
and b) We don't use the examples to feed or train any 
procedure. 



We also filtered out senses with low values 
in the cntlist file, and in any case we only 
considered at most the first six senses of 
a word. 

• Enriching sense characteristic vec- 
tors 

The relevance filter provided very good 
results in our experiments with SemCor 
and Senseval-1 data as far as precision is 
concerned, but the problem is that there 
is little overlapping between the defini- 
tions of the senses and the contexts in 
terms of coocurrence (after removing stop 
words and computing idf) which means 
that the previous heuristic didn't disam- 
biguate many words. 

To overcome this problem, we enrich the 
senses characteristic vectors adding for 
each word in the vector the words related 
to it via the relevance matrix weights. 
This corresponds to the algebraic notion 
of multiplying the matrix and the charac- 
teristic vector. In other words, if R is the 
relevance matrix and v our characteristic 
vector we would finally use Rv + v. 

This should increase the number of words 
disambiguated provided we eliminate the 
idf factor (which would be zero in most 
cases because now the sense characteris- 
tics vectors are not as sparse as before). 
When we also discard senses with low rel- 
ative frequency in SemCor we call this 
heuristic mixed filter. 

• back off strategies 

For those cases that couldn't be covered 
by other heuristics we employed the first 
sense heuristic. In the case of the super- 
vised system for the English lexical sam- 
ple task we thought of using the most fre- 
quent sense but didn't implement it due 
to lack of time. 

4 Systems and Results 

• UNED-AW U2 
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We won't delve into UNED-AW-U system 
as it is very similar to this one. This is 
an (arguably) unsupervised system for the 
English all words task. The heuristics we 
used and the results obtained for each of 
them are shown in Table 1. 



Heuristic 


Att. 


Score 


Prec 


Rec 


Monosemous exp 


514 


45500 


88.5% 


18.4% 


Statistical filter 


350 


27200 


77.7% 


11.0% 


Mixed filter 


1256 


50000 


39.8% 


20.2% 


Enriched Senses 


77 


4300 


55.8% 


3.1% 


First sense 


249 


13600 


54.6% 


5.5% 


Total 


2446 


140600 


57.5% 


56.9% 



Table 1: Unsupervised heuristics for English 
all words task 



If the individual heuristics are used as 
standalone WSD systems we would ob- 
tain the results in Table 2. 



System 


Att. 


Score 


Prec 


Recall 


First sense 


2405 


146900 


61.1% 


59.4% 


UNED-AW-U2 


2446 


140600 


57.5% 


56.9% 


Mixed filter 


2120 


122600 


57.8% 


49.6% 


Enriched senses 


2122 


108100 


50.9% 


43.7% 


Random 


2417 


89191.2 


36.9% 


36.0% 


Statistical filter 


864 


72700 


84.1% 


29.4% 



Table 2: UNED-AW-U2 vs baselines 



In the lexical sample task, we weren't able to 
multiply by the relevance matrix due to time 
constraints, so in order to increase the cov- 
erage for the relevance filter heuristic we ex- 
panded the definitions of the senses with those 
of the first 5 levels of hyponyms. Also, we se- 
lected the radius of the context to be consid- 
ered depending on the POS of the word being 
disambiguated. For nouns and verbs we used 
25 words radius neighbourhood and for adjec- 
tives 5 words at each side. 

• UNED-LS-U This is essentially the 
same system as UNED-AW-U2, in this 
case applied to the lexical sample task. 
The results are displayed in Table 3. 

• UNED LS T 

This is a supervised variant of the previ- 
ous systems. We have added the training 



Heuristic 


Att. 


Score 


Prec 


Recall 


Relevance filt 


3039 


113617 


37.3% 


26.2% 


First sense 


1285 


60000 


46.7% 


13.9% 


Total 


4324 


173617 


40.2% 


40.2% 



Table 3: Unsupervised heuristics for English 
lexical sample task 



examples to the definitions of the senses 
giving the same weight to the definition 
and to all the examples as a whole (i.e. 
definitions are considered more interest- 
ing than examples) 



Heuristic 


Att. 


Score 


Prec 


Recall 


Relevance filt 


4116 


206150 


50.1% 


47.6% 


First sense 


208 


9300 


44.7% 


2.1% 


Total 


4324 


215450 


49.8% 


49.8% 



Table 4: Supervised heuristics for English lex- 
ical sample task 



5 Discussion and conclusions 

We've put a lot of effort into making the rele- 
vance matrix but its performance in the WSD 
task is striking. The matrix is interesting and 
its application in the relevance filter heuris- 
tic is slightly better than simple coocurrence 
counting, which proves that it doesn't discard 
relevant words. The problem seems to lie in 
the fact that irrelevant words (with respect to 
the word to be disambiguated) rarely occur 
both in the context of the word and in the 
definition of the senses (if they appeared in 
the definition they wouldn't be so irrelevant) 
so the direct impact of the information in the 
matrix is very weak. Likewise, relevant (via 
the matrix) words with respect to the word to 
be disambiguated occur often both in the con- 
text and in the definitions so the final result is 
very similar to simple coocurrence counting. 

This problem only showed up in the lexical 
sample task systems. In the all words systems 
we were to enrich the sense definitions to make 
a more advantageous use of the matrix. 

We were very confident that the relevance 
filter would yield good results as we have 
already evaluated it against the Senseval- 
1 and SemCor data. We felt however that 
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we could improve the coverage of the heuris- 
tic enriching the definitions multiplying by 
the matrix. A similar approach was used 



by Yarowsky ( Yarowsky, 1992 ) and Schiitze 
( Schiitze and Pedersen, 1995 ) and it worked 
for them. This wasn't the case for us; still, 
we think the resource is well worth research- 
ing other ways of using it. 

As for the overall scores, the unsupervised 
lexical sample obtained the highest recall of 
the unsupervised systems, which proves that 
carefully implementing simple techniques still 
pays off. In the all words task the UNED- 
WS-U2 had also the highest recall among the 
unsupervised systems (as characterized in the 
Senseval-2 web descriptions), and the fourth 
overall. We'll train it with the examples in 
Semcor 1.6 and see how much we can gain. 

6 Conclusions 

Our system scored first among unsupervised 
systems in both tasks: 56.9% recall in all 
words, 40.2% in lexical sample. This is slightly 
worse than the first sense heuristic for all 



words and 3.6% better for the lexical sample, 
a strong indication that unsupervised Word 
Sense Disambiguation remains being a strong 
challenge. 
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